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Abstract 

The Tweedie GLM is a widely used method for predicting insurance premiums. 
However, the structure of the logarithmic mean is restricted to a linear form in the 
Tweedie GLM, which can be too rigid for many applications. As a better alternative, 
we propose a gradient tree-boosting algorithm and apply it to Tweedie compound 
Poisson models for pure premiums. We use a profile likelihood approach to estimate the 
index and dispersion parameters. Our method is capable of fitting a flexible nonlinear 
Tweedie model and capturing complex interactions among predictors. A simulation 
study confirms the excellent prediction performance of our method. As an application, 
we apply our method to an auto insurance claim data and show that the new method is 
superior to the existing methods in the sense that it generates more accurate premium 
predictions, thus helping solve the adverse selection issue. We have implemented our 
method in a user-friendly R package that also includes a nice visualization tool for 
interpreting the fitted model. 

* McGill University 

I Rochester Institute of Technology 

1 Corresponding author, zoux019@uinn.edu, University of Minnesota 


1 



1 Introduction 


One of the most important problems in insurance business is to set the premium for the 
customers (policyholders). In a competitive market, it is advantageous for the insurer to 
charge a fair premium according to the expected loss of the policyholder. In personal car 
insurance, for instance, if an insurance company charges too much for old drivers and charges 
too little for young drivers, then the old drivers will switch to its competitors, and the 
remaining policies for the young drivers would be underpriced. This results in the adverse 


selection issue (Dionne et ah, 2001): the insurer loses prohtable policies and is left with bad 
risks, resulting in economic loss both ways. 

To appropriately set the premiums for the insurer’s customers, one crucial task is to 
predict the size of actual (currently unforeseeable) claims. In this paper, we will focus on 
modeling claim loss, although other ingredients such as safety loadings, administrative costs, 
cost of capital, and proht are also important factors for setting the premium. One difficulty 
in modeling the claims is that the distribution is usually highly right-skewed, mixed with 
a point mass at zero. Such type of data cannot be transformed to normality by power 
transformation, and special treatment on zero claims is often required. As an example. 
Figure [^shows the histogram of an auto insurance claim data (Yip and Yau, 2005), in which 
there are 6,290 policy records with zero claims and 4,006 policy records with positive losses. 

The need for predictive models emerges from the fact that the expected loss is highly 
dependent on the characteristics of an individual policy such as age and motor vehicle record 
points of the policyholder, population density of the policyholder’s residential area, and age 


and model of the vehicle. Traditional methods used generalized linear models (GLM; Nelder 


and Wedderburn, 1972) for modeling the claim size (e.g. Renshaw, 1994 Haberman and 


Renshaw, 1996|). However, the authors of the above papers performed their analyses on a 


subset of the policies, which have at least one claim. Alternative approaches have employed 


Tobit models by treating zero outcomes as censored below some cutoff points (Van de Ven 


and van Praag, 1981 Showers and Shotick, 1994), but these approaches rely on a normality 


assumption of the latent response. Alternatively, Jprgensen and de Souza (1994) and Smyth 


and Jprgensen (2002) used GLMs with a Tweedie distributed outcome to simultaneously 


model frequency and severity of insurance claims. They assume Poisson arrival of claims and 
gamma distributed amount for individual claims so that the size of the total claim amount 
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Figure 1: Histogram of the auto insurance claim data as analyzed in Yip and Yau (2005). 


It shows that there are 6290 policy records with zero total claims per policy year, while the 
remaining 4006 policy records have positive losses. 


follows a Tweedie compound Poisson distribution. Due to its ability to simultaneously model 
the zeros and the continuous positive outcomes, the Tweedie GLM has been a widely used 


method in actuarial studies (Mildenhall, 1999 Murphy et ah, 2000 Peters et ah, 2008). 


Despite of the popularity of the Tweedie GLM, a major limitation is that the structure of 
the logarithmic mean is restricted to a linear form, which can be too rigid for real applications. 
In auto insurance, for example, it is known that the risk does not monotonically decrease as 
age increases (|Anstey et ah 2005). Although nonlinearity may be modeled by adding splines 


(Zhang, 2011), low-degree splines are often inadequate to capture the non-linearity in the 


data, while high-degree splines often result in the over-htting issue that produces unstable 


estimates. Generalized additive models (GAM; Hastie and Tibshirani, 1990 Wood, 2006) 


overcome the restrictive linear assumption of GLMs, and can model the continuous variables 
by smooth functions estimated from data. The structure of the model, however, has to be 
determined a priori. That is, one has to specify the main effects and interaction effects to be 
used in the model. As a result, misspecihcation of non-ignorable effects is likely to adversely 
affect prediction accuracy. 
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In this paper, we aim to model the insurance claim size by a nonparametric Tweedie com¬ 
pound Poisson model, and propose a gradient tree-boosting algorithm (TDboost henceforth) 
to fit this model. To our knowledge, before this work, there is no existing nonparametric 
Tweedie method available. Additionally, we also implemented the proposed method as an 
easy-to-use R package, which is publicly available. 

Gradient boosting is one of the most successful machine learning algorithms for nonpara¬ 
metric regression and classification. Boosting adaptively combines a large number of rela¬ 
tively simple prediction models called base learners into an ensemble learner to achieve high 


prediction performance. The seminal work on the boosting algorithm called AdaBoost (Fre¬ 


und and Schapire, 1997) was originally proposed for classification problems. Later Breiman 


(1998) and Breiman (|1999) pointed out an important connection between the AdaBoost al¬ 


gorithm and a functional gradient descent algorithm. Friedman et ah (2000) and Hastie et ah 


(2009) developed a statistical view of boosting and proposed gradient boosting methods for 
both classification and regression. There is a large body of literature on boosting. We refer 


interested readers to Biihlmann and Hothorn (2007) for a comprehensive review of boosting 
algorithms. 

The TDboost model is motivated by the proven success of boosting in machine learning 


for classification and regression problems (Friedman, 2001, 2002 Hastie et ah, 2009). Its 
advantages are threefold. First, the model structure of TDboost is learned from data and 
not predetermined, thereby avoiding an explicit model specification. Non-linearities, discon¬ 
tinuities, complex and higher order interactions are naturally incorporated into the model 
to reduce the potential modeling bias and to produce high predictive performance, which 
enables TDboost to serve as a benchmark model in scoring insurance policies, guiding pricing 
practice, and facilitating marketing efforts. Feature selection is performed as an integral part 
of the procedure. In addition, TDboost handles the predictor and response variables of any 
type without the need for transformation, and it is highly robust to outliers. Missing values 


in the predictors are managed almost without loss of information (Elith et ah, 2008). All 
these properties make TDboost a more attractive tool for insurance premium modeling. On 
the other hand, we acknowledge that its results are not as straightforward as those from the 
Tweedie GLM model. Nevertheless, TDboost does not have to be regarded as a black box. 
It can provide interpretable results, by means of the partial dependence plots, and relative 
importance of the predictors. 
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The remainder of this paper is organized as follows. We briefly review the gradient 
boosting algorithm and the Tweedie compound Poisson model in Section and Section 
respectively. We present the main methodological development with implementation details 
in Section]^ In Sectionj^ we use simulation to show the high predictive accuracy of TDboost. 
As an application, we apply TDboost to analyze an auto insurance claim data in Section 


2 Gradient Boosting 


Gradient boosting (Friedman, 2001) is a recursive, nonparametric machine learning algo¬ 


rithm that has been successfully used in many areas. It shows remarkable flexibility in 
solving different loss functions. By combining a large number of base learners, it can handle 
higher order interactions and produce highly complex functional forms. It provides high 
prediction accuracy and often outperforms many competing methods, such as linear regres¬ 


sion/classihcation, bagging (Breiman, 1996), splines and CART (Breiman et ah, 1984). 

To keep the paper self-contained, we briefly explain the general procedures for the gra¬ 
dient boosting. Let x = (xi,... be a p-dimensional column vector for the predictor 

variables and y be the one-dimensional response variable. The goal is to estimate the opti¬ 
mal prediction function F[-) that maps x to ?/ by minimizing the expected value of a loss 
function •) over the function class 


F(-) = 

where T is assumed to be differentiable with respect to F. Given the observed data 
{?/i,where Xj = {xn,... ,Xipy, estimation of F(-) can be done by minimizing the 
empirical risk function 

1 JL 

( 1 ) 


i=l 

For the gradient boosting, each candidate function F G is assumed to be an ensemble of 
M base learners 

M 

F(x) = F™ + (2) 


m=l 


where h(x; usually belongs to a class of some simple functions of x called base learners 
(e.g., regression/decision tree) with the parameter (m = 1, 2, ■ ■ ■ , M). F^®] is a constant 
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scalar and is the expansion coefficient. Note that differing from the nsnal strnctnre of 
an additive model, there is no restriction on the nnmber of predictors to be inclnded in each 
h{-), and conseqnently, high-order interactions can be easily considered nsing this setting. 

A forward stagewise algorithm is adopted to approximate the minimizer of Q, which 
bnilds np the components (m = 1, 2,..., M) seqnentially throngh a gradient- 

descent-like approach. At each iteration stage m (m = 1 , 2 ,...), snppose that the cnrrent 
estimate for F{-) is To npdate the estimate from to the gradient 

boosting fits a negative gradient vector (as the working response) to the predictors nsing 
a base learner This fitted can be viewed as an approximation of the 

negative gradient. Snbseqnently, the expansion coefficient can then be determined by 
a line search minimization with the empirical risk fnnction, and the estimation of T’(x) for 
the next stage becomes 


FH(x) := F[”*-^](x) + 


(3) 


where 0 < z/ < 1 is the shrinkage factor (Friedman, 2001) that controls the npdate step 


size. A small v imposes more shrinkage while v = 1 gives complete negative gradient steps. 


Friedman (2001) has fonnd that the shrinkage factor reduces over-fitting and improves the 


predictive accnracy. 


3 Compound Poisson Distribution and Tweedie Model 

In insnrance preminm prediction problems, the total claim amonnt for a covered risk nsnally 
has a continnons distribntion on positive valnes, except for the possibility of being exact zero 
when the claim does not occnr. One standard approach in actnarial science in modeling snch 
data is nsing Tweedie componnd Poisson models, which we briefly introdnce in this section. 

Let be a Poisson random variable denoted by Pois(A), and let Z^’s (d = 0,1,..., iV) 
be i.i.d. gamma random variables denoted by Gamma(a, 7 ) with mean ay and variance ay^. 
Assnme N is independent of Z^’s. Define a random variable Z by 

{ 0 a N = 0 

(4) 

Zi -\- Z 2 Ztv if = 1 , 2 ,... 
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Thus Z is the Poisson sum of independent Gamma random variables. In insurance appli¬ 
cations, one can view Z as the total claim amount, iV as the number of reported claims 
and Zrf’s as the insurance payment for the dth claim. The resulting distribution of Z is 


referred to as the compound Poisson distribution (Jprgensen and de Souza, 1994; Smyth and 


Jprgensen, 2002), which is known to be closely connected to exponential dispersion models 


(EDM) (Jprgensen, [1987 ). Note that the distribution of Z has a probability mass at zero: 
PriZ = 0) = exp(—A). Then based on that Z conditional on iV = j is Gamma(jQ;, 7 ), the 
distribution function of Z can be written as 


fz{z\\, a, 7 ) = Pr{N = 0)do{z) + ^ Pr{N = j)fz\N=j{z) 


1=1 


exp(-A)cio(^) + ^ 


Aig-A 

^ j! 7^“P(ja) ’ 


where do is the Dirac delta function at zero and fz\N=j is the conditional density of Z given 
N = j. Smyth (1996| pointed out that the compound Poisson distribution belongs to a 


special class of EDMs known as Tweedie models (Tweedie, 1984), which are dehned by the 
form 

fz{z\9,(p) = (5) 

where a(-) is a normalizing function, k{-) is called the cumulant function, and both a(-) 
and k{-) are known. The parameter 6 * is in M and the dispersion parameter 0 is in IR+. 
For Tweedie models the mean E{Z) = /r = k{6) and the variance Var(Z’) = (j)k{6), where 
k{9) and k{9) are the hrst and second derivatives of k{9), respectively. Tweedie models 
have the power mean-variance relationship Var(Z) = for some index parameter p. Such 
mean-variance relation gives 




log/i, p=l 


<9) = 


p-p 

2-p ■ 


p^2 


log/i, p = 2 


( 6 ) 


One can show that the compound Poisson distribution belongs to the class of Tweedie 
models. Indeed, if we reparametrize (A, a, 7 ) by 


1 ^ 

02-p’ 


a = 


2 -P 

P-1’ 


7 = (\){p-\)pP \ 


(7) 
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the compound Poisson model will have the form of a Tweedie model with 1 < p < 2 and 
p > 0. Asa result, for the rest of this paper, we only consider the model Q, and simply refer 
to Q as the Tweedie model (or Tweedie compound Poisson model), denoted by Tw(/r, 0, p), 
where 1 < p < 2 and p > 0. 

It is straightforward to show that the log-likelihood of the Tweedie model is 

log/z(^|p,0,p) = -^) +loga(^,0,p), (8) 

(p\ 1- p 2- pj 

where the normalizing function a(-) can be written as 


a{z,(j),p) 


1 


1 _Py_ 

2 (p-i)‘“,ji‘(i+“)(2-p)U!rpo) 


for > 0 
for z = 0 


and a = (2 — p)/(p — 1) and 


(Tweedie, 1984). 


Wt is an example of Wright’s generalized Bessel function 


4 Our Proposal 


In this section, we propose to integrate the Tweedie model to the tree-based gradient boosting 
algorithm to predict insurance claim size. Specihcally, our discussion focuses on modeling 
the personal car insurance as an illustrating example (see Section]^ for a real data analysis), 
since our modeling strategy is easily extended to other lines of non-life insurance business. 

Given an auto insurance policy i, let W be the number of claims (known as the claim 
frequency) and be the size of each claim observed for dj = 1,..., W- Let Wi be the policy 
duration, that is, the length of time that the policy remains in force. Then Zi = '^Z=i 
is the total claim amount. In the following, we are interested in modeling the ratio between 
the total claim and the duration Y) = Zj/tCj, a key quantity known as the pure premium 


(Ohlsson and Johansson, 2010). 


Following the settings of the compound Poisson model, we assume W is Poisson dis¬ 
tributed, and its mean \iWi has a multiplicative relation with the duration tCj, where A* is 
a policy-specihc parameter representing the expected claim frequency under unit duration. 
Conditional on W, assume (dj = 1,...,W) are i.i.d. Gamma(a, 7 j), where is a 











policy-specific parameter that determines claim severity, and a is a constant. Furthermore, 
we assume that under unit duration (i.e., Wi = 1), the mean-variance relation of a policy 
satishes Var{Y*) = cl)[E{Y*)Y for all policies, where Y* is the pure premium under unit du¬ 
ration, 0 is a constant, and p = (Q!-|-2)/(a-l-l). Then, it is known that Tw(/ri,0/M;i,p), 
the details of which are provided in Appendix Part A. 

Then, we consider a portfolio of policies {(?/*, Xj, from n independent insurance 

contracts, where for the Ah contract, yi is the policy pure premium, Xj is a vector of ex¬ 
planatory variables that characterize the policyholder and the risk being insured (e.g. house, 
vehicle), and Wi is the duration. Assume that the expected pure premium pi is determined 
by a predictor function F : —)■ M of xp 


■ogifii} = >og{£’(y'i|x.)} = F(Xi). 


(9) 


In this paper, we do not impose a linear or other parametric form restriction on F[-). Given 
the flexibility of F{-), we call such setting as the boosted Tweedie model (as opposed to the 
Tweedie GLM). Given Xj, the log-likelihood function can be written as 

n 

= ^\og fY{yi\Pi, (j)/wi,p), 
i=l 

n / l-p 2-p \ 

= -^—) +loga(2/i,0M,p). (10) 

(j) \ l-p 2-pJ 


4.1 Estimating F( ) via TDboost 

We estimate the predictor function F{-) by integrating the boosted Tweedie model into the 
tree-based gradient boosting algorithm. To develop the idea, we assume that 0 and p are 
given for the time being. The joint estimation of F"(-), 0 and p will be studied in Section 
021 

Given p and 0, we replace the general objective function in ([^ by the negative log- 
likelihood derived in (10), and target the minimizer function F*{-) over a class F of base 
learner functions in the form of (|^. That is, we intend to estimate 


p* 


,x = 


argmin{ - ^(F(-), 0, p|{j/i, x^, = argmin ^ T(?/i, F(xi)|p), (11) 






i=l 
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where 


,T,^ M ^ ; l/iexp 1-pFxi exp 2 - p F 

^{yi,F{xi) p) = Wi<{ -^-+--- 

1-p 2-p 

Note that in contrast to ( [II| ), the function class targeted by Tweedie GLM (Smyth, [l996 ) is 
restricted to a collection of linear functions of x. 

We propose to apply the forward stagewise algorithm described in Section for solving 
(0. The initial estimate of F*{-) is chosen as a constant function that minimizes the 
negative log-likelihood: 


= argmin p I p) 


2=1 


log 


E n 

i=l '^iVi 


E n 

i=iWi 


This corresponds to the best estimate of F without any covariates. Let be the current 

estimate before the mth iteration. At the mth step, we £t a base learner h(x;^^™'^) via 


I”" = argniin^[M[™' - h(xi; 

^[m] -i—^ 


2=1 


( 12 ) 


where is the current negative gradient of T(- | p), i.e., 


M = 


aT(pi,F(xi) I p) 


dFi'x.i) 

F(xi)=F[™-il(xP 

w,{ - y, exp[(l - p)Fl—il(xi)] + exp[(2 - p)Fl—^^(xi)]}, 


(13) 

(14) 


and use an L-terminal node regression tree 




1 = 1 


(15) 


with parameters as the base learner. To hnd and mI™', we use a 


fast top-down “best-fit” algorithm with a least squares splitting criterion (Friedman et al. 


2000) to hnd the splitting variables and corresponding split locations that determine the 


htted terminal regions Note that estimating the entails estimating the uf 
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as the mean falling in each region: 


-[ml / [r? 

u\ ' = mean, ^-[m] (u) 


/ — 1,..., L. 


Once the base learner has been estimated, the optimal value of the expansion 

coefficient is determined by a line search 


= argmiti J]] F'"' ''(x.) +I rt 


(16) 


i=l 


argmin^d'(|/j,F[™ ^'(x*) +/3^/(xj e | p). 

^ 1^1 


1=1 


The regression tree (15) predicts a constant value u] within each region R) , so we can solve 


(16) by a separate line search performed within each respective region Rl . The problem 


(16) reduces to hnding a best constant rji to improve the current estimate in each region 
Rf^^ based on the following criterion: 


Vi 


argmin ^ ^{yi, + r] \ p), 1 = 1,...,L, 




where the solution is given by 


^ m 1 

Vi = log 


(17) 


'WiViexpKl - ^^(xi)] 

'«^iexp[(2 - p)FN-h(xi)] /’ 




(18) 


Having found the parameters we then update the current estimate ^K^) 

in each corresponding region 


f IH(x) = fI'“-i1(x) + € Sr'), I = 1,...,L, 


(19) 


where 0 < i/ < 1 is the shrinkage factor. Following (Friedman, 2001), we set u = 0.005 in 


our implementation. More discussions on the choice of tuning parameters are in Section 4.4 


In summary, the complete TDboost algorithm is shown in Algorithm [Tj The boosting 
step is repeated M times and we report F[^1 (x) as the hnal estimate. 
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Algorithm 1 TDboost 

1. Initialize 


= log 


r 


2. For m = 1,..., M repeatedly do steps 2.(a)-2.(d) 

2.(a) Compute the negative gradient {u^^\ ... jtilTk 


m 

M- = 


w,{ - |/.exp[(l - p)F[”^-ExO] + exp[(2 - p)F^^-^\^,)]} t = 


2.(b) Fit the negative gradient vector ,..., Un y to (xi,..., x„)''' by an L-terminal 
node regression tree, where Xj = {xn,... jXipY for i = 1,... ,n, giving us the 
partitions 

2.(c) Compute the optimal terminal node predictions r]f^^ for each region / = 


1,2,...,L 


_ f Ex*)] | 

Ei:x,eFt™'^*®^p[E-p)E™-^Kx*)] j 
2.(d) Update F[™1(x) for each region i?E, / = 1,2,..., L 

fH(x) = f[™-P(x) + z/k’f'E e .rE) / = 1,2,... ,L. 


3. Report as the hnal estimate. 
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4.2 Estimating (/?, 0) via profile likelihood 


Following Dunn and Smyth] ( |2005 ), we use the profile likelihood to estimate the dispersion 
(j) and the index parameter p, which jointly determine the mean-variance relation V ar{Yi) = 
(j)p1/wi of the pure premium. We exploit the fact that in Tweedie models the estimation 
of p depends only on p\ given a fixed p, the mean estimate p*{p) can be solved in ( [TT] ) 
without knowing 0. Then conditional on this p and the corresponding p*{p), we maximize 
the log-likelihood function with respect to 0 by 


0*(p) = argmax{£(p*(p),0,p)}, 
0 


( 20 ) 


which is a univariate optimization problem that can be solved using a combination of golden 


section search and successive parabolic interpolation (Brent, 2013). In such a way, we have 
determined the corresponding (p*(p), (f>*{p)) for each fixed p. Then we acquire the estimate of 
p by maximizing the profile likelihood with respect to 50 equally spaced values {pi,..., pso} 
on (0,1): 

p* = argmax {£(p*(p),0*(p),p)}. (21) 

pe{pi,...,P5o} 


Finally, we apply p* in (11) and (20) to obtain the corresponding estimates p*{p*) and 0*(p* 


Some additional computational issues for evaluating the log-likelihood functions in (20) and 


(21) are discussed in Appendix Part B. 


4.3 Model interpretation 

Compared to other nonparametric statistical learning methods such as neural networks and 
kernel machines, our new estimator provides interpretable results. In this section, we discuss 
some ways for model interpretation after fitting the boosted Tweedie model. 

4.3.1 Marginal effects of predictors 

The main effects and interaction effects of the variables in the boosted Tweedie model can 
be extracted easily. In our estimate we can control the order of interactions by choosing the 
tree size L (the number of terminal nodes) and the number p of predictors. A tree with L 
terminal nodes produces a function approximation of p predictors with interaction order of 
at most min(L — l,p). For example, a stump (L = 2) produces an additive TDboost model 


13 










with only the main effects of the predictors, since it is a function based on a single splitting 
variable in each tree. Setting L = 3 allows both main effects and second order interactions. 


Following Friedman (2001) we use the so-called partial dependence plots to visualize the 


main effects and interaction effects. Given the training data {i/i, with a p-dimensional 

input vector x = (xi, X 2 , ■ ■ ■, XpY, let Zg be a subset of size s, such that Zg = {zi, ..., Zg} C 
{xi,..., a:p}. For example, to study the main effect of the variable j, we set the subset 
Zg = {zj}, and to study the second order interaction of variables i and j, we set Zg = {zi, Zj}. 
Let z\g be the complement set of Zg, such that z\g U Zg = {xi,... ,Xp}. Let the prediction 
F(zs|z\s) be a function of the subset Zg conditioned on specihc values of z^^- The partial 
dependence of F’(x) on Zg then can be formulated as F{zg\z\g) averaged over the marginal 
density of the complement subset z\g 


Fg{zg) = / F{zg\z\g)p\g{z\g)dz\g, 


( 22 ) 


where p\g{z\g) = Jp{'x)dzs is the marginal density of z^^. We estimate (22) by 


Fg{zg) = -y^ F{zg\z\g^i), 

n 


(23) 


Z=1 


where {z\s,i}f=i are evaluated at the training data. We then plot Fs{zg) against Zg. We 
have included the partial dependence plot function in our R package “TDboost”. We will 
demonstrate this functionality in Section 


4.3.2 Variable importance 

In many applications identifying relevant predictors of the model in the context of tree- 
based ensemble methods is of interest. The TDboost model dehnes a variable importance 
measure for each candidate predictor Xj in the set X = {Xi,... ,Xp} in terms of predic¬ 
tion/explanation of the response Y. The major advantage of this variable selection pro¬ 
cedure, as compared to univariate screening methods, is that the approach considers the 
impact of each individual predictor as well as multivariate interactions among predictors 
simultaneously. 

We start by dehning the variable importance (VI henceforth) measure in the context of 


a single tree. First introduced by Breiman et ah (1984), the VI measure IxATm) of the 
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variable Xj in a single tree is defined as the total heterogeneity reduction of the response 
variable Y produced by Xj, which can be estimated by adding up all the decreases in the 
squared error reductions 6i obtained in all L — 1 internal nodes when Xj is chosen as the 
splitting variable. Denote v{Xj) = I the event that Xj is selected as the splitting variable in 
the internal node I, and let Iji = I{v{Xj) = 1). Then 

L-l 

= (24) 

i=i 


where 6i is defined as the squared error difference between the constant fit and the two 
sub-region fits (the sub-region fits are achieved by splitting the region associated with the 


internal node I into the left and right regions). [Friedman (2001) extended the VI measure 


for the boosting model with a combination of M regression trees, by averaging (D^ over 
{Ti,..., Tm}: 


M 




(25) 


m=l 


Despite of the wide use of the VI measure, Breiman et ah (1984) and White and Liu (1994) 


among others have pointed out that the VI measures ( |24[ ) and (25) are biased: even if Xj is 
a non-informative variable to Y (not correlated to V), Xj may still be selected as a splitting 


variable, hence the VI measure of Xj is non-zero by Equation (25). Following Sandri and 


Zuccolotto (2008) and Sandri and Zuccolotto (2010) to avoid the variable selection bias, in 


this paper we compute an adjusted VI measure for each explanatory variable by permutating 
each Xj, the computational details are provided in Appendix Part C. 


4.4 Implementation 


We have implemented our proposed method in an R package “TDboost”, which is publicly 
available from the Comprehensive R Archive Network at http://cran.r-project.org/ 
web/packages/TDboost/index.html, Here, we discuss the choice of three meta parameters 
in Algorithmic L (the size of the trees), u (the shrinkage factor) and M (the number of 
boosting steps). 

To avoid over-fitting and improve out-of-sample predictions, the boosting procedure can 


be regularized by limiting the number of boosting iterations M (early stopping; Zhang and 
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Yu, 2005) and the shrinkage factor v. Empirical evidence (Friedman, 2001; Biihlmann and 


Hothorn, 2007 Ridgeway, 2007) showed that the predictive accuracy is almost always better 


with a smaller shrinkage factor at the cost of more computing time. However, smaller values 
of V usually requires a larger number of boosting iterations M and hence induces more 


computing time (Friedman, 2001). We choose a “sufficiently small” u = 0.005 throughout 


and determine M by the data. 

The value L should reflect the true interaction order in the underlying model, but we 
almost never have such prior knowledge. Therefore we choose the optimal M and L using K- 
fold cross validation, starting with a hxed value of L. The data are split into K roughly equal¬ 
sized folds. Let an index function tt{i) : {1,..., n} h->• {1,..., 77} indicate the fold to which 
observation i is allocated. Each time, we remove the fcth fold of the data [k = 1, 2,..., 77), 
and train the model using the remaining 77—1 folds. Denoting by (x) the resulting 
model, we compute the validation loss by predicting on each 7th fold of the data removed: 


CV(M, L) = -Y, '!'(!/!■ -F'7(.,(x.; i) U). 

•I . . 

1=1 


(26) 


We select the optimal M at which the minimum validation loss is reached 

Ml = argrmn CV(M, L). 

If we need to select L too, then we repeat the whole process for several L (e.g. L = 2,3, 4, 5) 
and choose the one with the smallest minimum generalization error 

L = argrmn CV{L, Ml). 

For a given u, htting trees with higher L leads to smaller M being required to reach the 


minimum error. 


5 Simulation Studies 


In this section, we compare TDboost with the Tweedie GLM model (TGLM: Jprgensen 


and de Souza, 1994) and the Tweedie GAM model in terms of the function estimation 


performance. The Tweedie GAM model is proposed by Wood (2001), which is based on a 
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penalized regression spline approach with automatic smoothness selection. There is an R 
package “MGCV” accompanying the work, available at http://cran.r-project.org/web/ 
packages/mgcv/index .html. In all numerical examples below using the TDboost model, 
hve-fold cross validation is adopted for selecting the optimal (M, L) pair, while the shrinkage 
factor z/ is set to its default value of 0.005. 

5.1 Case I 

In this simulation study, we demonstrate that TDboost is well suited to £t target functions 
that are non-linear or involve complex interactions. We consider two true target functions: 

• Model 1 (Discontinuous function): The target function is discontinuous as dehned by 
F{x) = 0.5/(x > 0.5). We assume x ~ Unif(0,l), and y ~ Tw(/i, 0, p) with p = 1.5 
and (j) = 0.5. 

• Model 2 (Complex interaction): The target function has two hills and two valleys. 

F{xi,X2) = 

which corresponds to a common scenario where the effect of one variable changes 
depending on the effect of another. We assume Xi,X 2 ~ Unif(0,1), and y Tw(/i,0,p) 
with p = 1.5 and 0 = 0.5. 

We generate n = 1000 observations for training and n' = 1000 for testing, and fit the 
training data using TDboost, MGCV, and TGLM. Since the true target functions are known, 
we consider the mean absolute deviation (MAD) as performance criteria, 

n' 

MAD = iV|F(x,)-F(x,)|, 
n' 

i=\ 

where both the true predictor function T’(xj) and the predicted function i^(xj) are evaluated 
on the test set. The resulting MADs on the testing data are reported in Table which are 
averaged over 100 independent replications. The htted functions from Model 2 are plotted 
in Figure In both cases, we hud that TDboost outperforms TGLM and MGCV in terms 
of the ability to recover the true functions and gives the smallest prediction errors. 
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(a) True F{xi,X2) 


(b) TDboost F(xi,X2) 



F(xi, 



F(xi, 


(c) TGLM F{x2,X2) 


(d) MGCV F(xi,X 2 ) 




F(xu 


T(ii, 


Figure 2: Fitted curves that recover the target function dehned in Model 2. The top left 
hgure shows the true target function. The top right, bottom left, and bottom right hgures 
show the predictions on the testing data from TDboost, TGLM, and MGCV, respectively. 
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Model 

TGLM 

MGGV 

TDboost 

1 

0.1102 (0.0006) 

0.0752 (0.0016) 

0.0595 (0.0021) 

2 

0.3516 (0.0009) 

0.2511 (0.0004) 

0.1034 (0.0008) 


Table 1: The averaged MADs and the corresponding standard errors based on 100 indepen¬ 
dent replications. 


5.2 Case II 


The idea is to see the performance of the TDboost estimator and MGCV estimator on a 
variety of very complicated, randomly generated predictor fnnctions, and stndy how the size 
of the training set, distribution settings and other characteristics of problems affect final 
performance of the two methods. We use the “random function generator” (RFG) model by 


Friedman (2001) in our simulation. The true target function F is randomly generated as a 
linear expansion of functions {gk}^=i- 
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F(x) ^ ybkgkizik). 


(27) 


k=l 


Here each coefficient bk is a uniform random variable from Unif[—1,1]. Each gk{zk) is a 
function of z^, where Zk is defined as a p^-sized subset of the ten-dimensional variable x in 
the form 

Zfc = ( 28 ) 

where each ^l!k is an independent permutation of the integers {1,... ,p}. The size pk is ran¬ 
domly selected by min([2.5 -1- r^J ,p), where Vk is generated from an exponential distribution 
with mean 2. Hence the expected order of interactions presented in each gki^k) is between 
four and five. Each function gki^k) is a p^-dimensional Gaussian function: 


gk{zk) = exp 



Ufc)Wfc(zfc - Ufc)|, 


(29) 


where each mean vector is randomly generated from N(0,lpj,). The pk x pk covariance 
matrix is defined by 

Vfc = UfcDfcU^, (30) 

where Ufc is a random orthonormal matrix, = dmpjdfc)!],..., dfciPfc]}, and the square 
root of each diagonal element is a uniform random variable from Unif [0.1, 2.0]. We 
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generate data according to 


l/i ~ Tw(/ii,0,p), Xi~N(0,Ip), i = 


(31) 


where Hi = exp{F(xj)}. 

Setting I: when the index is known 

Firstly, we study the situation that the true index parameter p is known when htting models. 
We generate data according to the RFG model with index parameter p = 1.5 and the 
dispersion parameter 0 = 1 in the true model. We set the number of predictors to be p = 10 
and generate n G {1000,2000,5000} observations as training sets, on which both MGCV 
and TDboost are htted with p specihed to be the true value 1.5. An additional test set of 
n' = 5000 observations was generated for evaluating the performance of the final estimate. 

Figure shows simulation results for comparing the estimation performance of MGGV 
and TDboost, when varying the training sample size. The empirical distributions of the 
MADs shown as box-plots are based on 100 independent replications. We can see that in all 
of the cases, TDboost outperforms MGGV in terms of prediction accuracy. 

We also test estimation performance on p when the index parameter p is misspecihed, 
that is, we use a guess value p differing from the true value p when htting the TDboost model. 
Because p is statistically orthogonal to 0 and p, meaning that the off-diagonal elements of 


the Fisher information matrix are zero (Jprgensen, 1997), we expect p will vary very slowly 


as p changes. Indeed, using the previous simulation data with the true value p = 1.5 and 
0 = 1, we htted TDboost models with nine guess values of p G (1.1,1.2,..., 1.9}. The 
resulting MADs are displayed in Figure]^ which shows the choice of the value p has almost 
no signihcant ehect on estimation accuracy of p. 


Setting II: using the estimated index 

Next we study the situation that the true index parameter p is unknown, and we use the 


estimated p obtained from the prohle likelihood procedure discussed in Section 4^ for htting 
the model. The same data generation scheme is adopted as in Setting I, except now both 
MGGV and TDboost are htted with p estimated by maximizing the prohle likelihood. Figure 
[^shows simulation results for comparing the estimation performance of MGGV and TDboost 
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MGCV TDBoost MGCV TDBoost MGGV TDBoost 

Method 


Figure 3: Simulation results for Setting I: compare the estimation performance of MGCV 
and TDboost when varying the training sample size and the dispersion parameter in the 
true model. Box-plots display empirical distributions of the MADs based on 100 independent 
replications. 



1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 


P* 

Figure 4: Simulation results for Setting I when the index is misspecihed: the estimation per¬ 
formance of TDboost when varying the value of the index parameter p G {1.1,1.2,..., 1.9}. 
In the true model p = 1.5 and 0 = 1. Box-plots show empirical distributions of the MADs 
based on 200 independent replications. 
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Figure 5: Simulation results for Setting II: compare the estimation performance of MGCV 
and TDboost when varying the training sample size and the dispersion parameter in the 
true model. Box-plots display empirical distributions of the MADs based on 100 independent 
replications. 

in such setting. We can see that the results have no significant difference to the results of 
Setting I: TDboost still outperforms MGCV in terms of prediction accuracy when using the 
estimated p instead of the true value. 

Lastly, we demonstrate our results from the estimation of the dispersion 0 and the index 
p by using the prohle likelihood. A total number of 200 sets of training samples are randomly 


generated from a true model according to the setting (31) with 0 = 2 and p = 1.7, each 


sample having 2000 observations. We fit the TDboost model on each sample and compute the 
estimates 0* at each of the 50 equally spaced values {pi,..., pso} on (1, 2). The (p^, 0*(pj)) 
corresponding to the maximal profile likelihood is the estimate of (p, 0). The estimation 
process is repeated 200 times. The estimated indices have mean p* = 1.68 and standard 
error SE{p*) = 0.026, so the true value p = 1.7 is within p* ± SE{p*). The estimated 
dispersions have mean 0* = 1.82 and standard error SE{(j)*) = 0.12. Figure shows the 
prohle likelihood function of p for a single run. 
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Figure 6: The curve represents the prohle likelihood function of p from a single run. The 
dotted line shows the true value p = 1.7. The solid line shows the estimated value p* = 1.68 
corresponding to the maximum likelihood. The associated estimated dispersion is 0* =1.89. 


6 Application: Automobile Claims 


6.1 Dataset 


We consider an auto insurance claim dataset as analyzed in Yip and Yau (2005) and Zhang 


and Yu (2005). The data set contains 10,296 driver vehicle records, each record including an 


individual driver’s total claim amount (zi) in the last hve years {wi = 5) and 17 characteristics 
Xi = ... ,Xi^n) for the driver and the insured vehicle. We want to predict the expected 

pure premium based on x*. Table summarize the data set. The descriptive statistics of 
the data are provided in Appendix Part D. The histogram of the total claim amounts in 
Figure shows that the empirical distribution of these values is highly skewed. We hnd 
that approximately 61.1% of policyholders had no claims, and approximately 29.6% of the 
policyholders had a positive claim amount up to 10,000 dollars. Note that only 9.3% of 
the policyholders had a high claim amount above 10,000 dollars, but the sum of their claim 
amount made up to 64% of the overall sum. Another important feature of the data is 
that there are interactions among explanatory variables. For example, from Table we can 
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AREA 

REVOKED 

No 

Yes 

Urban 

3150.57 

14551.62 

Rural 

904.70 

7624.36 

Difference 


11401.05 

6719.66 


Table 2: The averaged total claim amount for different categories of the policyholders. 


ID 

Variable 

Type 

Description 

1 

AGE 

N 

Driver’s age 

2 

BLUEBOOK 

N 

Value of vehicle 

3 

HOMEKIDS 

N 

Number of children 

4 

KIDSDRIV 

N 

Number of driving children 

5 

MVR_PTS 

N 

Motor vehicle record points 

6 

NPOLICY 

N 

Number of policies 

7 

RETAINED 

N 

Number of years as a customer 

8 

TRAVTIME 

N 

Distance to work 

9 

AREA 

G 

Home/work area: Rural, Urban 

10 

CAR_USE 

G 

Vehicle use: Gommercial, Private 

11 

CAR_TYPE 

G 

Type of vehicle: Panel Truck, Pickup, 

Sedan, Sports Gar, SUV, Van 

12 

GENDER 

G 

Driver’s gender: E, M 

13 

JOBGLASS 

G 

Unknown, Blue Collar, Clerical, Doctor, 

Home Maker, Lawyer, Manager, Professional, Student 

14 

MAX_EDUG 

G 

Education level: High School or Below, Bachelors, 

High School, Masters, PhD 

15 

MARRIED 

G 

Married or not: Yes, No 

16 

REVOKED 

G 

Whether license revoked in past 7 years: Yes, No 


Table 3: Explanatory variables in the claim history data set. Type N stands for numerical 
variable, Type C stands for categorical variable. 


see that the marginal effect of the variable REVOKED on the total claim amount is much 
greater for the policyholders living in the urban area than those living in the rural area. The 
importance of the interaction effects will be conhrmed later in our data analysis. 


6.2 Models 

We separate the entire dataset into a training set and a testing set with equal size. Then 
the TDboost model is htted on the training set and tuned with hve-fold cross validation. 
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For comparison, we also fit TGLM and MGCV, both of which are htted using all the ex¬ 
planatory variables. In MGGV, the numerical variables AGE, BLUEBOOK, HOMEKIDS, 
KIDSDRIV, MVR_PTS, NPOLIGY, RETAINED and TRAVTIME are modeled by smooth 
terms represented using penalized regression splines. We hnd the appropriate smoothness 


for each applicable model term using Generalized Gross Validation (GGV) (Wahba, 1990). 


For the TDboost model, it is not necessary to carry out data transformation, since the tree- 
based boosting method can automatically handle different types of data. For other models, 
we use logarithmic transformation on BLUEBOOK, i.e. log(BLUEBOOK), and scale all 
the numerical variables except for HOMEKIDS, KIDSDRIV, MVR_PTS and NPOLIGY to 
have mean 0 and standard deviation 1. We also create dummy variables for the categorical 
variables with more than two levels (GAR_TYPE, JOBGLASS and MAX_EDUG). For all 
models, we use the prohle likelihood method to estimate the dispersion cj) and the index p, 
which are in turn used in htting the hnal models. 


6.3 Performance comparison 

To examine the performance of TGLM, MGGV and TDboost, after htting on the training set, 
we predict the pure premium P(x) = //.(x) by applying each model on the independent held- 
out testing set. However, attention must be paid when measuring the differences between 
predicted premiums T’(x) and real losses y on the testing data. The mean squared loss 
or mean absolute loss is not appropriate here because the losses have high proportions of 
zeros and are highly right skewed. Therefore an alternative statistical measure - the ordered 
Lorenz curve and the associated Gini index - proposed by Frees et ah ( |2011 ) are used 
for capturing the discrepancy between the premium and loss distributions. By calculating 
the Gini index, the performance of different predictive models can be compared. Here we 


only briehy explain the idea of the ordered Lorenz curve (Frees et ah, 2011, 2013). Let 


R(x) be the “base premium”, which is calculated using the existing premium prediction 
model, and let T’(x) be the “competing premium” calculated using an alternative premium 
prediction model. In the ordered Lorenz curve, the distribution of losses and the distribution 
of premiums are sorted based on the relative premium R(x) = P(x)/R(x). The ordered 
premium distribution is 
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Dp{s) = 

and the ordered loss distribution is 


< s) 

Er=i5(x0 


DLis) = 


'ELiViHRi^i) < s) 

V V- 
z^i=i y* 


Two empirical distributions are based on the same sort order, which makes it possible to 
compare the premium and loss distributions for the same policyholder group. The ordered 
Lorenz curve is the graph of {Dp{s), Dl{s)). When the percentage of losses equals the 
percentage of premiums for the insurer, the curve results in a 45-degree line, known as “the 
line of equality”. Twice the area between the ordered Lorenz curve and the line of equality 
measures the discrepancy between the premium and loss distributions, and is defined as the 
Gini index. Curves below the line of equality indicate that, given knowledge of the relative 
premium, an insurer could identify the prohtable contracts, whose premiums are greater 
than losses. Therefore, a larger Gini index (hence a larger area between the line of equality 
and the curve below) would imply a more favorable model. 


Following Frees et ah (2013), we successively specify the prediction from each model as 


the base premium B{x.) and use predictions from the remaining models as the competing 
premium P(x) to compute the Gini indices. The entire procedure of the data splitting and 
Gini index computation are repeated 20 times, and a matrix of the averaged Gini indices 
and standard errors is reported in Table To pick the “best” model, we use a “minimax” 


strategy (Frees et ah, 2013) to select the base premium model that are least vulnerable to 


competing premium models; that is, we select the model that provides the smallest of the 
maximal Gini indices, taken over competing premiums. We hnd that the maximal Gini index 
is 15.528 when using B{x.) = as the base premium, 12.979 when B{x.) = 

and 4.000 when B{x.) = (x). Therefore, TDboost has the smallest maximum Gini 

index at 4.000, hence is the least vulnerable to alternative scores. Figure also shows that 
when TGLM (or MGGV) is selected as the base premium, the area between the line of 
equality and the ordered Lorenz curve is larger when choosing TDboost as the competing 
premium, indicating again that the TDboost model represents the most favorable choice. 
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Figure 7: The ordered Lorenz curves for the auto insurance claim data. 


Base Premium 

Gompeting Premium 

TGLM 

MGGV 

TDboost 

TGLM 

0 

7.833 (0.338) 

15.528 (0.509) 

MGGV 

3.044 (0.610) 

0 

12.979 (0.473) 

TDboost 

4.000 (0.364) 

3.540 (0.415) 

0 


Table 4: The averaged Gini indices and standard errors in the auto insurance claim data 
example based on 20 random splits. 


6.4 Interpreting the results 

Next, we focus on the analysis using the TDboost model. There are several explanatory 
variables signihcantly related to the pure premium. The VI measure and the baseline value 
of each explanatory variable are shown in Figure We hnd that REVOKED, MVR_PTS, 
AREA and BLUEBOOK have high VI measure scores (the vertical line), and their scores all 
surpass the corresponding baselines (the horizontal line-length), indicating that the impor¬ 
tance of those explanatory variables is real. We also find the variables AGE, JOBGLASS, 
GAR_TYPE, NPOLIGY, MAX_EDUG, MARRIED, KIDSDRIV and GAR_USE have larger- 
than-baseline VI measure scores, but the absolute scales are much less than aforementioned 
four variables. On the other hand, although the VI measure of, e.g., TRAVTIME is quite 
large, it does not signihcantly surpass the baseline importance. 

We now use the partial dependence plots to visualize the htted model. Figure shows 
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Figure 8: The variable importance measures and baselines of 17 explanatory variables for 
modeling the pure premium. 


the main effects of four important explanatory variables on the pure premium. We clearly 
see that the strong nonlinear effects exist in predictors BLUEBOOK and MVR_PTS: for 
the policyholders whose vehicle values are below 40K, their pure premium is negatively as¬ 
sociated with the value of vehicle; after the value of vehicle passes 40K, the pure premium 
curve reaches a plateau; Additionally, the pure premium is positively associated with mo¬ 
tor vehicle record points MVR_PTS, but the pure premium curve reaches a plateau when 
MVR_PTS exceeds six. On the other hand, the partial dependence plots suggest that a pol¬ 
icyholder who lives in the urban area (AREA= “URBAN”) or with driver’s license revoked 
(REVOKED=“YES”) typically has relatively high pure premium. 

In our model, the data-driven choice for the tree size is L = 7, which means that our 
model includes higher order interactions. In Figure we visualize the effects of four 
important second order interactions using the joint partial dependence plots. These four 
interactions are AREA x MVR_PTS, AREA x NPOLICY, AREA x REVOKED and AREA 
X TRAVTIME. These four interactions all involve the variable AREA: we can see that 
the marginal effects of MVR_PTS, NPOLICY, REVOKED and TRAVTIME on the pure 
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Figure 9: Marginal effects of four most significant explanatory variables on the pure premium. 

premium are greater for the policyholders living in the urban area (AREA= “URBAN”) 
than those living in the rural area (AREA= “RURAL”). For example, a strong AREA x 
MVR_PTS interaction suggests that for the policyholders living in the rural area, motor 
vehicle record points of the policyholders have a weaker positive marginal effect on the 
expected pure premium than for the policyholders living in the urban area. 

7 Conclusions 

The need for nonlinear risk factors as well as risk factor interactions for modeling insurance 
claim sizes is well-recognized by actuarial practitioners, but practical tools to study them 
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are very limited. In this paper, relying on neither the linear assnmption nor a pre-specihed 
interaction structure, a flexible tree-based gradient boosting method is designed for the 
Tweedie model. We implement the proposed method in a user-friendly R package “TDboost” 
that can make accurate insurance premium predictions for complex data sets and serve as a 
convenient tool for actuarial practitioners to investigate the nonlinear and interaction effects. 
In the context of personal auto insurance, we implicitly use the policy duration as a volume 
measure (or exposure), and demonstrate the favorable prediction performance of TDboost 
for the pure premium. In cases that exposure measures other than duration are used, which is 
common in commercial insurance, we can extend the TDboost method to the corresponding 
claim size by simply replacing the duration with any chosen exposure measure. 

TDboost can also be an important complement to the traditional GLM model in insurance 
rating. Even under the strict circumstances that the regulators demand the final model to 
have a GLM structure, our approach can still be quite helpful due to its ability to extract 
additional information such as non-monotonicity/non-linearity and important interaction. 
In Appendix Part E, we provide an additional real data analysis to demonstrate that our 
method can provide insights into the structure of interaction terms. After integrating the 
obtained information about the interaction terms into the original GLM model, we can much 
enhance the overall accuracy of the insurance premium prediction while maintaining a GLM 
model structure. 

In addition, it is worth mentioning that the applications of the proposed method can 
go beyond the insurance premium prediction and be of interest to researchers in many 


other fields including ecology ( 

Foster and Bravington 

2013 

), meteorology ( 

Dunn 

2004) 

and political science ( 

Lauderdale 

2012 

). See, for example. 

Dunn and Smyth] ( 

200f 

)) and 


Qian et al. (2015) for descriptions of the broad Tweedie distribution applications. The 


proposed method and the implementation tool allow researchers in these related fields to 
venture outside the Tweedie GLM modeling framework, build new flexible models from 
nonparametric perspectives, and use the model interpretation tools demonstrated in our real 
data analysis to study their own problems of interests. 
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